Skip to content

Spark: Document broadcast size-map scaling in rewrite table action#17023

Closed
szehon-ho wants to merge 1 commit into
apache:mainfrom
szehon-ho:rewrite-table-delete-size-broadcast-comment
Closed

Spark: Document broadcast size-map scaling in rewrite table action#17023
szehon-ho wants to merge 1 commit into
apache:mainfrom
szehon-ho:rewrite-table-delete-size-broadcast-comment

Conversation

@szehon-ho

Copy link
Copy Markdown
Member

What changes are included in this PR?

Follow-up to #15470. That PR introduced a per-delete-file size map that is collected to the driver and broadcast to the manifest-rewrite tasks of RewriteTablePathSparkAction. During review I asked whether the broadcast map's memory footprint had been considered for delete-heavy tables (review comment).

This adds a short comment near the broadcast site documenting the scaling assumption: the map holds one entry per distinct rewritten delete-file location, so its footprint scales with the number of distinct delete files being rewritten rather than the table's total file count. No behavior change.

The comment is added identically to the v3.5, v4.0, and v4.1 Spark modules.

How are these changes tested?

Comment-only change; no functional change, so no new tests.


AI Disclosure

Follow-up to apache#15470. Add a comment explaining that the delete-file size
map collected to the driver and broadcast to the manifest-rewrite tasks
scales with the number of distinct delete files being rewritten, not the
table's total file count.

Generated-by: Cursor (Claude Opus 4.8)
@github-actions github-actions Bot added the spark label Jun 30, 2026
@szehon-ho

Copy link
Copy Markdown
Member Author

sorry spurious pr

@szehon-ho szehon-ho closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant